Follow the User?!

Data Donation Studies for Collecting Digital Trace Data


Session 3️⃣: Data Donation Studies (Researcher Perspective)

Frieder Rodewald (University of Mannheim) & Valerie Hase (LMU Munich)


👉 Part of the SPP DFG Project Integrating Data Donations in Survey Infrastructure

What are methodological decisions researchers you have to take in data donation studies? 🤔

Data donation study - researcher perspective

process of data donation study

Figure. Data donation study - researcher perspective

Agenda

  1. Research design & tool set-up, including

    📢 Task 3: Modify the data donation tool

  2. Data cleaning & augmentation, including

    📢 Task 4: Classify search terms

  3. Modelling digital traces

Image by Hope House Press via Unsplash

1) Research design & tool set-up

image of lupe

Source: Image by Markus Winkler via Unsplash

Step I: Research design & tool set-up

process of data donation study

Figure. Data donation study - researcher perspective

Step I: Research design & tool set-up

Key decisions:

  • Which theoretical questions do I want to answer?
  • How do I operationalize key variables via my data donation tool?
  • How do I integrate the tool in surveys & recruit participants?

Step I: Research design & tool set-up

Key decisions:

  • Which theoretical questions do I want to answer?
  • How do I operationalize key variables via my data donation tool?
  • How do I integrate the tool in surveys & recruit participants?

Step I.I Which questions do I want to answer?

This may sound silly but:

  • Novel method, few empirical applications
  • To date: methodological playground
  • What good is a method that is not used to advance theories/empirical knowledge?

Step I: Research design & tool set-up

Key decisions:

  • Which theoretical questions do I want to answer?
  • How do I operationalize key variables via my data donation tool?
  • How do I integrate the tool in surveys & recruit participants?

Step I.II: How do I operationalize key variables?

Choose a tool, e.g., …

Step I.II: How do I operationalize key variables?

  • Participants “upload” data
  • Local extraction, anonymization, & aggregation
  • Users can delete data
  • Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

  • Participants “upload” data
  • Local extraction, anonymization, & aggregation
  • Users can delete data
  • Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

Extraction🔎:

Files in data donation packages

Figure. Filtering data - File extraction

Step I.II: How do I operationalize key variables?

Extraction🔎:

Python code for extracting files

Figure. Filtering data - Python code

Step I.II: How do I operationalize key variables?

Extraction🔎:

Python code for extracting files

Figure. Filtering data - Python code

Step I.II: How do I operationalize key variables?

  • Participants “upload” data
  • Local extraction, anonymization, & aggregation
  • Users can delete data
  • Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

Anonymization 🙈:

Example whitelists for news outlets

Figure. Anonymization - Example of Whitelists

Step I.II: How do I operationalize key variables?

Anonymization 🙈:

Example of anonymized data

Figure. Example of anonymized data

Step I.II: How do I operationalize key variables?

  • Participants “upload” data
  • Local extraction, anonymization, & aggregation
  • Users can delete data
  • Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

Aggregation 🧮:

Python code for aggregation

Figure. Aggregation - Python code

Step I.II: How do I operationalize key variables?

  • Participants “upload” data
  • Local extraction, anonymization, & aggregation
  • Users can delete data
  • Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

Data deletion by users ❌:

Example of how users can delete their data

Figure. Data deletion

Step I.II: How do I operationalize key variables?

This is how much “fun” testing DDTs is:

Github screenshot of testing

Figure. Github issues - Testing the tool

Step I.II: How do I operationalize key variables?

Key issues 🚨 (Hase et al., 2024)

  • Missing documentation by platforms (e.g., file structure)
  • Sudden changes in DDPs
  • Differences across languages & devices
  • Insufficient in-tool classification

Let’s have a look at the technical set-up (Frieder: Run example?)

📢 Task 3: Modify the data donation tool

Frieder: can we ask them to change filtering scripts, etc.? YouTube-URL extraction?

Feel free to work in groups of 2-3 people.

Step I: Research design & tool set-up

Key decisions:

  • Which theoretical questions do I want to answer?
  • How do I operationalize key variables via my data donation tool?
  • How do I integrate the tool in surveys & recruit participants?

Step I.III: How do I integrate the tool in surveys & recruit participants?

  • Often: Survey, then forwarding to an external site
  • Less often: Integration in existing survey infrastructure (Haim et al., 2023)

Step I.III: How do I integrate the tool in surveys & recruit participants?

  • Low response rates (e.g., Hase & Haim, 2024; Keusch et al., 2024)

    • Behavioral intentions as “willingness to donate” high (79-52% of survey respondents)
    • Actual behavior as “participation in data donation” low (37-12% of survey respondents)
    • Well known intention-behavior gap (Kmetty & Stefkovics, 2025)
  • Non-response bias

  • Primary used in non-probability panels (e.g. online access panels)

  • Survey design strategies: For now, 🤑 is the only thing that works.

  • 👉 Again, we will talk about this in session 4️⃣.

Step I: Research design & tool set-up

process of data donation study

Figure. Data donation study - researcher perspective

Step II: Data cleaning & augmentation

process of data donation study

Figure. Data donation study - researcher perspective

Step II.I: How do I clean and extend data?

This is how your data may look like:

Example of donated data

Figure. Donated data - example

Step II.I: How do I clean and extend data?

This is how your data may look like:

process of data donation study

Figure. Donated data - example

Step II.I: How do I clean and extend data?

  • Manual annotation by participants during data donation
  • APIs/scraping to extend collected data

📢 Task 4: Classify search terms

Download the data for Task 4 from the workshop website. This contains YouTube searches collected from a German social media sample. Either discuss or do this in R/Python…..

  1. how you would clean the data?

  2. how you would identify health-related searches using NLP methods?

Example of YouTube searches

Figure. Donated data - example

Step II.II: How do I check for bias?

👉 You know the drill: We will talk about this in session 4️⃣.

Step II: Data cleaning & augmentation

process of data donation study

Figure. Data donation study - researcher perspective

Step III: Modelling

process of data donation study

Figure. Data donation study - researcher perspective

Step III.I: How do I analyze results?

Think carefully about…

  • How to create indices from different metrics (e.g., liking, sharing, or commenting on content)
  • Hierarchical structure (nested in time, metrics, platforms)
  • Skewed data, non-linearity

Summary: Researcher perspective 📚

  • Summary: Key steps include…

    1. Research design & tool set-up
    2. Data cleaning & augmentation
    3. Modelling
  • Further literature:

    • Boeschoten et al. (2022)

    • Carrière et al. (2024)

Questions? 🤔

References

Boeschoten, L., Mendrik, A., Van Der Veen, E., Vloothuis, J., Hu, H., Voorvaart, R., & Oberski, D. L. (2022). Privacy-preserving local analysis of digital trace data: A proof-of-concept. Patterns, 3(3), 100444. https://doi.org/10.1016/j.patter.2022.100444
Boeschoten, L., Schipper, N. C. de, Mendrik, A. M., Veen, E. van der, Struminskaya, B., Janssen, H., & Araujo, T. (2023). Port: A software tool for digital data donation. Journal of Open Source Software, 8(90), 5596.
Carrière, T. C., Boeschoten, L., Struminskaya, B., Janssen, H. L., De Schipper, N. C., & Araujo, T. (2024). Best practices for studies using digital data donation. Quality & Quantity. https://doi.org/10.1007/s11135-024-01983-x
Haim, M., Leiner, D., & Hase, V. (2023). Integrating Data Donations into Online Surveys. Medien & Kommunikationswissenschaft, 71(1-2), 130–137. https://doi.org/10.5771/1615-634X-2023-1-2-130
Hase, V., Ausloos, J., Boeschoten, L., Pfiffner, N., Janssen, H., Araujo, T., Carrière, T., De Vreese, C., Haßler, J., Loecherbach, F., Kmetty, Z., Möller, J., Ohme, J., Schmidbauer, E., Struminskaya, B., Trilling, D., Welbers, K., & Haim, M. (2024). Fulfilling Data Access Obligations: How Could (and Should) Platforms Facilitate Data Donation Studies? Internet Policy Review, 13(3). https://doi.org/10.14763/2024.3.1793
Hase, V., & Haim, M. (2024). Can We Get Rid of Bias? Mitigating Systematic Error in Data Donation Studies through Survey Design Strategies. Computational Communication Research, 6(2), 1. https://doi.org/10.5117/CCR2024.2.2.HASE
Keusch, F., Pankowska, P. K., Cernat, A., & Bach, R. L. (2024). Do You Have Two Minutes to Talk about Your Data? Willingness to Participate and Nonparticipation Bias in Facebook Data Donation. Field Methods, 36(4), 279–293. https://doi.org/10.1177/1525822X231225907
Kmetty, Z., & Stefkovics, Á. (2025). Validating a willingness to share measure of a vignette experiment using real-world behavioral data. Scientific Reports, 15(1), 9319. https://doi.org/10.1038/s41598-025-92349-2
Kohne, J., & Montag, C. (2024). ChatDashboard: A Framework to collect, link, and process donated WhatsApp Chat Log Data. Behavior Research Methods, 56(4), 3658–3684.
Pak, C., Cotter, K., & Thorson, K. (2022). Correcting Sample Selection Bias of Historical Digital Trace Data: Inverse Probability Weighting (IPW) and Type II Tobit Model. Communication Methods and Measures, 16(2), 134–155. https://doi.org/10.1080/19312458.2022.2037537
Pfiffner, N., Witlox, P., & Friemel, T. N. (2022). Data Donation Module. https://github.com/uzh/ddm
TeBlunthuis, N., Hase, V., & Chan, C.-H. (2024). Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can! Communication Methods and Measures, 18(3), 278–299. https://doi.org/10.1080/19312458.2023.2293713